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Abstract 

We present the first tree-based regressor whose con- 
vergence rate depends only on the intrinsic dimen- 
sion of the data, namely its Assouad dimension. 
The regressor uses the RPtree partitioning proce- 
dure, a simple randomized variant of k-d trees. 



1 Introduction 

Non-parametric learning algorithms tend to suffer from what 
is referred to as the curse of dimensionality, namely that pre- 
diction performance deteriorates dramatically as the number 
of features increases. This phenomenon is quantifiable in the 
case of regression algorithms: as initially shown by Stone 
IISto80l [Sto82l . if we only assume that the regression func- 
tion f{x) is Lipschitz Q in R D , then no non-parametric esti- 
mator can achieve a convergence rate faster than n~ 2 / ( 2+D *) . 
In other words, the number of points required to attain a low 
risk may be exponential in D, and this is infeasible even for 
moderate values of D. 

However, it is often the case that data which appears high 
dimensional, actually conforms to a structure of low intrinsic 
dimensionality (interpreted broadly). Examples of such sit- 
uations are traditional continuous settings where the data is 
close to a low dimensional submanifold of R D , and discrete 
settings such as when the data is sparse. These are all ex- 
amples of data with low Assouad dimension (see definition 
[T]); this notion of dimension thus offers a natural and broad 
model of intrinsic data complexity. 

We show that, for any input data distribution, the risk of a 
regressor based on RPtree (a variant of k-d tree) depends just 
on the unknown Assouad dimension of the data, regardless 
of the ambient dimension D. This is the first such result for 
tree-based regression. 

1.1 Tree-based regression 

Tree-based regression consists of first building a hierarchy 
of nested partitions of the data space (the tree), and then 
learning a piecewise continuous function /„ over the cells 
of some chosen partition in the hierarchy. Future evaluations 
of fn(x) can be done in time just O(logn) by navigating 




(a) Dyadic tree 



(b) k-d tree 



(c) RPtree 



'Stone's result concerns a much larger class of regression func- 
tions; here we focus on Lipschitz conditions. 



Figure 1: Spatial partitioning induced by various splitting 
rules. Two levels or the tree are shown for each. 



the usually shallow tree down to an appropriate cell. These 
methods are popular due to their ease of use and compu- 
tational efficiency (e.g. CART, dyadic trees, k-d tree, see 
RGN051 [SNOSl ILGL96I ), but none has been shown to adapt 
to intrinsic dimensionality in terms of their regression risk. 
See figure[T]for some examples. 

The Random Projection tree (RPtree) is a hierarchical 
partitioning procedure which recursively bisects the data space 
with random hyperplanes (see figure |l(c)| i. Although RP- 
tree's connections to intrinsic data dimensionality has been 
studied in unsupervised settings ( BDF08lfGLZ081 ), its use for 
regression has not been explored. 

Using RPtrees for regression requires a method for se- 
lecting a partition on which to learn the regressor f n . Select- 
ing a good partition from the hierarchy is essential to bal- 
ancing the bias and variance of the regressor. Traditional 
methods use penalized empirical risk minimization over all 
possible partitions induced by the tree. Our approach can be 
more efficient in practice. We grow the tree in careful steps 
that enable us to quickly identify a small set of candidate 
partitions. We then provide a couple of options for select- 
ing the final partition: one is to use cross-validation over the 
candidate partitions, another is a criterion which allows to 
automatically stop growing the tree when a good partition 
is attained. The latter method is computationally cheaper, 
while the former method results in a slightly better risk. In 
both cases, the excess risk of the RPTree regressor depends 
just on the unknown Assouad dimension of the input space, 
for all distributions. 

On the technical side, RPtree regression requires novel 
techniques for analyzing the bias of the estimator. Estima- 
tor bias is well understood to decrease with the diameters of 
the partition's cells. Unfortunately these physical diameters 
are hard to assess for RPtrees given the random and irregu- 



lar shapes of the cells, and in fact they may not decrease at 
all. However, we can track the diameters of the data within 
the cells, and we develop new techniques to relate these em- 
pirical data diameters to the estimator's bias. We believe 
these techniques are of independent interest as they take fo- 
cus away from the cells' physical diameters, thus opening 
the door to richer partitioning rules whose cell diameters are 
hard to control. 

1.2 Background and related work 

The realization that data is often less complex than indicated 
by the ambient dimension has spurred a significant body of 
work (referred to as manifold learning) that aim to embed 
the data into a low dimensional euclidean space (see e.g. 
HRS00I IBN03I ITSL00I ). A possible approach to regression 
on high dimensional data is to first reduce dimension using 
manifold learning and learn the regressor in the new space. 
Unfortunately, this approach is not guaranteed to work since 
pertinent information may be lost by the embedding. This 
raises the following natural question: can learning methods 
such as regression adapt automatically to data that has low 
intrinsic dimensionality while operating in the original space 

m d i 

An important result in the direction of adaptive regres- 
sion is the realization by Bickel and Li [BL06| that standard 
kernel regressors are adaptive in the following sense: there 
exists an appropriate bandwidth setting such that the asymp- 
totic pointwise risk at x 6 MP depends just on the manifold 
dimension and on the behavior of the kernel in a neighbor- 
hood of x. One then has to search for the appropriate band- 
width setting, either by estimating the manifold dimension 
or through cross validation over all possible values of this 
dimension (see e.g. BL06llLW07l ). 

Kernel regressors can be expensive in practice: the kernel 
weights must be computed anew at each training point in 
order to evaluate the regressor on a new data point. This 
translates into an evaluation time of il(n) which is often a 
burden given large samples. Contrast this with the 0(log n) 
evaluation time of tree-based regressors. 

In the case of classification, a recent result by Scott and 
Nowak ([SN06J) for dyadic decision trees is related: they 
show that if the input data is drawn from an approximately 
uniform measure on a manifold, and the Bayes decision bound- 
ary is sufficiently smooth, DDTs achieve classification rates 
that depend just on the manifold dimension. It is unclear 
whether their result will apply in a distribution free regres- 
sion setting. 

2 Detailed overview of results 

We're given i.i.d training data (X,Y) = {(X t7 Y t )}f =1 e 
(X x yY 1 , where the input space X C MP is contained 
in a balQof (unknown) diameter Ax, and the output space 
y C M. D is contained in a ball of (unknown) diameter Ay. 

2.1 Assouad dimension 

We model the intrinsic dimensionality of the space X using 
the notion of Assouad dimension defined below. 

2 We assume a Euclidean l-i norm in this work. 




(a) Sparse data set. (b) 2-d manifold. 

Figure 2: Examples of data with low Assouad dimension. 

Definition 1 The Assouad dimension ( or doubling dimen- 
sion) of X d M D is the smallest d such that for any ball 
B C M D , the set BOX can be covered by 2 d balls of half 
the radius of B. 

The Assouad dimension has proved useful in capturing 
the intrinsic complexity of data spaces as shown in various 
works on data analysis (see e.g. |IN07 BKL06, Cla05|). 
It coincides with the natural notions of dimension of vari- 
ous geometric objects: it is easy to see that d-dimensional 
cubes, spheres, all have Assouad dimension 0(d) (see e.g. 
ICla05l ). It also captures notions of data complexity that are 
standard in the machine learning and statistics communities; 
this is stated in the following remarks for emphasis. 

Remark 1 A d-dimensional hyperplane in M D has Assouad 
dimension 0(d) (see hCla05V ). 

Remark 2 A d-dimensional Riemannian submanif old of MP 
has Assouad dimension 0(d), subject to a bound on its cur- 
vature (see theorem 22 of HDF08\l ). 

Remark 3 A d-sparse data space in MP , i.e. one where each 
data point has at most d non zero coordinates, has Assouad 
dimension O(dlogD): it can be described by (^) < D d 
hyperplanes of dimension d. 

2.2 Notions of diameter 

Let A be some partition of X. Traditionally, bias analysis re- 
volves around the physical diameters A (A) = max ||a; — x'\ 

of cells A € A (see e.g. IIGN05I ISN061 ILGL96I ). In this 
work we instead relate bias to the data diameters of the cells, 

that is A n (A) = max lb - x'\\ or Oif A H X = 0. 

i,i'eAnx 




Focusing on data diameter has the following advantage. 
We never need to evaluate the physical diameters of the cells, 
and these need not decrease. Consequently, we don't have to 



constrain the partition to regular shaped cells (e.g. axis par- 
allel hyper-rectangles) whose physical diameters are easily 
controlled. In particular, it opens the door to richer parti- 
tioning rules such as RPtree which adapt better to the data 
complexity at the expense of creating irregular cells. We ex- 
pand on this last point in the example below. 
Consider a data space of the following form: 

Utyj{tei±sej : t G [—1,1]}, i,j G [D] , for a fixed e « 1. 

This is an extreme case of a noisy sparse data set of Assouad 
dimension O(logD), depicted in figure [2(a)| We'd like to 
partition this space in a way that reduces the data diame- 
ters of the cells (for low estimator bias) while achieving a 
small partition size (for low estimator variance). Axis par- 
allel splitting rules such as k-d trees or dyadic trees would 
require a number of cells exponential in D in order to halve 
the diameters. Yet, the set itself can be partitioned into at 
most 2D 2 cells of half its radius. The richness of random 
splits allows us to achieve a partitioning just a bit larger than 
this, even in the worst case over distributions on the set. In 
fact, given any data set of Assouad dimension d, RPtrees are 

guaranteed to achieve a partition of size at most 2°( d \ such 
that the data diameters of each cell is at most half of the di- 
ameter of the full data set. We refer the reader to 1DF081 for 
a detailed analysis. 

We'll soon see that, for low estimator bias, we don't 
need every cell of a partition to have small data diameter, 
but rather that these diameters are small in an average sense. 
Given a collection A of disjoint subsets of X, we define the 
following notion of average data diameter: 



A„(A) 



1/2 



Saga ^n{A) 

where fi n is the empirical measure over X (we'll let /i denote 
the marginal measure over X). 

2.3 Regression setup 

We assume that the regression function f(x) = E[Y\X = x] 
is A-Lipschitz, for an unknown parameter A: 

Vx,x'eX, \\f(x)-f(x')\\ <\\\x-x'\\. 

For any function g(x) : X \— > y, the 1% pointwise risk at 
x satisfies 

R(g(x)) =E Y \\Y- g(x)\\ 2 = R(f(x)) + \\f(x) - g(x)f , 
and the integrated risk can then be written as 

R[g) = E x R(9(X)) = R(f) + E x - g(X)\\ 2 . 

Thus, the pointwise excess risk of g(x) over f{x) is simply 
|| f(x) — g(x) || .In this paper we'll be interested in the inte- 
grated excess risk 

||/ - g\\ 2 = R(g) - R(f) = E x \\f(X) - g(X)f . 

2.4 Choosing a good partition for regression 

A tree-based regressor works in two phases. The partition- 
ing phase returns a partition A of the data space X and a 
final regressor is learned as a piecewise continuous function 
over the cells of A. In this work we'll consider a piecewise 



constant regressor over the returned partition A defined as 
follows: 

For x € X, let A(x) be the cell of A to which x belongs. If 
fj, n (A(x)) > 0, the regressor is obtained as 



fn,A( X ) 



J27=i Y * ' 1 x,gA(x) 
n ■ jj, n (A(x)) 



otherwise use a default setting f n .A(x) — yo G y whenever 
A(x) is empty of training points. We'll often refer to the 
final regressor as /„(•) as long as the partition used for the 
estimate is clear from context. 

Procedure adapt iveRPtree makes calls to the the sub- 
procedure coreRPtree which implements the basic RPtree 
splits. We defer the complete treatment of this subproce- 
dure to section 15.11 since most of the analysis will concern 
adaptiveRPtree. For now, note that the call to coreRPtree 
returns a subtree rooted at A with the following property: let 
A be the collection of subsets of MP defined by the leaves 
of this subtree, we have A„(A) < A n (A)/2. Also, the im- 
plementation of coreRPtree ensures that the final tree built 
by adaptiveRPtree has height at most 6 logn. 

Procedure adaptiveRPtree grows the tree in steps A , 
A 1 , . . ., where A„ (A l+1 ) < A„ (A 4 ) /2, and eventually 
returns one of the partitions A 1 for some i. We present a 
couple of options for selecting a good partition to return. The 
first option uses cross-validation: grow a large tree and prune 
it back by minimizing empirical risks over an i.i.d test sam- 
ple (X', Y') of size n. The other option is that of automatic 
stopping: we return a partition as soon as some stopping con- 
dition is met. 

The two options for selecting the return partition are out- 
lined in procedure adaptiveRPtree. The empirical risk in 
the cross-validation option is defined as 



<(<?) = -£ 11*7- a(xi)\\ 

71 £■ 



i 6 [n] 

The automatic stopping option returns one of two par- 
titions and requires no test sample. It is a computationally 
faster option and, as we'll see, the resulting bounds are only 
marginally worsened. 

2.5 Main Results 

Definition 2 Given a sample X, we say that adaptiveRPtree 

attains a diameter decrease rate ofk on X/or k > d, if every 
call to the subprocedure coreRPtree (A, A n (A) /2, 5, 1) in 
the second loop of the procedure returns a tree rooted at A 
of depth at most k. 

Theorem 3 Assume that X has Assouad dimension d. There 
exist constants C, C independent of d and p(X), such that 
the following holds. 

Suppose the cross-validation option is used. Define 

a(n) = (log 2 n) loglog(n/£) + log(l/5), 

and assume n > max |(AA^/Aj;) 2 , a(n)| . With proba- 
bility at least 1—5, the algorithm attains a diameter decrease 
rate of k < C'dlogd, and the excess risk of the regressor 



Procedure adapt iveRPt ree {sample X, confidence parameter o ) 


A u 








for 


t <— 1 to oo do 






foreach cell A G A 4 1 do 








// Create a subtree rooted at A: 




I <— level (A) in the current tree ; / / 


Root is at level 




(subtree rooted at A) «— coreRPtree 


(A, A n (A)/2,6,l); 




end 








A 1 <— partition of A" defined by the leaves of the current tree; 




level (A 1 ) <— max^gAi level (A) ; 








// At this point we have two 


options for stopping and returning a partition. 




Option 1: Cross-validation 






if A„ (A*) = or level (A 1 ) > logn 2 then 






Draw test sample (X', Y') of size n and define R' (•) as the empirical risk over the test sample; 




A <- argmin R' n {f n ,Ai)\ 








AiG{A° A*} 








return /„ = / n>A «; 








end 








Option 2: Automatic stopping 






a(n) <— (log 2 n) loglog(?i/<5) + log(l/<5); 






if level (A 4 ) > log (n • A 2 (A 4 )/a(n)A 2 (A")) then 




A <— argmin — ^ ■ \A J + 


A 2 (AO); 




A^ e{A* _1 , A*} V n 








return /„ = f n<Ae ; 








end 






end 









satisfies 

|| /n -/|| 2 < C.(AA^)W fc )f^M 

V n 



2/(2+fc) 



2A ^lnlogn" + ln3/tf 



Theorem 4 Assume that X has Assouad dimension d. There 
exist constants C, C" independent of d and p(X), such that 
the following holds. 

Suppose the automatic stopping option is used. Define 

a(n) = (log 2 n) loglog(n/5) + log(l/5). 

With probability at least 1 — 8, the algorithm attains a diam- 
eter decrease rate of k < C'dlogd, and the excess risk of 
the regressor satisfies 

2/(2+fc) 

\U-f\\ 2 <C-(A 2 y + A 2 ) (A% + 1) 



a(n) 



Analysis outline 

We start in section[3]by laying out the necessary tools for the 
rest of the analysis. 

The theorems are then proved in two parts. First we 
bound the excess risk of the algorithm in terms of the ob- 
served diameter decrease rates in section |4] (lemma Qj] for 



the cross -validation option, and lemma[T5]for the automatic 
stopping). We subsequently argue that these decrease rates 
depend just on the intrinsic dimensionality of the data (corol- 
lary [T7] of section |5J. 

Theorem[3]results from lemma[l3]and corollary[l7] while 
theorem|4]results from lemma [131 and corollary [TTl 

3 Proof preliminaries: risk bound for /„ A 

In this section we develop the necessary tools to bound the 
excess risk of f n ,A, where A is an RPtree partition, i.e. A is 
defined by the leaves of some subtree of the tree returned by 
adapt iveRPtree. 

3.1 Generic decomposition of excess risk 

We start the analysis with a standard decomposition of the 
excess risk into bias and variance terms. Let A be any par- 
tition of X. The following function of x € X provides a 
bridge between the regressor f n ,A an d the regression func- 
tion /: 

/«,A(.X) = ^ X /.,A(Z) = n/an ( A (x)) ' 

if p n (A(x)) ^ 0, otherwise we set f n ,A(%) = Ho <= y ■ 
The pointwise excess risk can be bounded as 



|/«,a(oO-/(z)II 



< 2 f n ,A{x) - f n ,A(x) 
+2 fn,A( x ) ~ f( x 



(1) 



We therefore proceed by bounding each term on the r.h.s sep- 
arately in the following two lemmas. 

Lemma 5 (Variance) Let Abe a partition of X. The fol- 
lowing inequality holds for all x £ X s.t. p, n (A(x)) > 0, 
with probability at least 1 — 8' over the random choice ofY 
for X fixed: 



fn,A( x ) ~ fnM x ) 



2 <A2 2 + ln(|A|/*') 



np n (A(x)) 



(2) 



Proof: Fix X. Now fix A G A, and let x £ A. We'll 
consider Y^ = {Yi £ Ys.t.X, £ A}. Write: 



</>(Y, 



fn,A(x) - fn,A(x) 

£r=i(*w(*i))i*eA 



np n (A) 



We can now apply McDiarmid's inequality to ip(-), as it is 
easy to verify that, changing one of the Y values in Y^ 
changes the value of tp(-) by at most n ^ A ^ ■ We then have 
that, 



y 



/ln(|A|/<5') 
2nfi„{A) 



with probability at least 1 — 8' / | A| over the random choice 
afY A . 

The expectation can be bounded as follows 



Etp(Y / 



< 



EW(Ya))' 



1/2 



E 



npn(A) 



< 



< 



(npn(A)) 2 



1/2 



1/2 



(np n (A)) 2 



1/2 



^np n {A) 

The first inequality above is an application of Jensen's in- 
equality. The second inequality results from the fact that, 
for independent random vectors Vi with null expectation, we 
have E ||^,. v.i\\ 2 = ^ - E || «i|| 2 ; here we just take Vi to be 
{Yi - f{xb)t Xi eA/ {WniA)). 

Combining the above yields the desired bound on tp (Ya ) 
with probability at least 1 — 8'/ \ A|. We then conclude with 
a union bound over all A £ A. ■ 



Lemma 6 (Bias) Let Abe a partition of X. The following 
inequality holds for all x € X s.t. p n (A(x)) > 0: 



/ n ,A(aO - f(x) < A A (A(s)) 



(3) 




(a) Cover B 



(b) Partition A (c) Partition A' 



Figure 3: We start with a cover B of X with balls of different 
size, next we see the data and obtain a partition A, we then 
substitute A with A' by intersecting the cells of A with balls 
ofB. 



Proof: Fix A e A and let x e A. Now write 



fn,A(x) ~ f{x) 



£r=i(/(^)-/(*))i*, 6 A 



< 



< 



npn(A) 

zZU\\f^)-f{x)\\t x ^ A 

np n (A) 

YZ=iM\Xi-x\\t Xi z A 

np n (A) 
< A 2 A 2 (A) , 

where the second inequality results from the Lipschitz con- 
dition on /(•). ■ 

In lemma [6] above, the bias is bounded in terms of the 
physical diameters A(A). However, for an RPtree partition 
A (i.e. A is defined by the leaves of some subtree), the phys- 
ical diameters {A(A), A £ A} could be as large as Ax, the 
diameter of the whole space. As previously discussed, RP- 
tree focuses on decreasing the data diameters A n (A), and 
we'll argue that this is sufficient to decrease the bias of the 
estimator. For this purpose, we will replace RPtree partitions 
A with alternate partitions A' as explained in the next sec- 
tion. 

3.2 Alternate partitions 

Given a partition A built by RPtree, we will consider an al- 
ternate partition A' which will serve to analyze the bias of 
the regressor f n ,A ( see above discussion of lemma|6]l. Each 
cell of A' will either contain no data point, or has physical 
diameter roughly the same as its data diameter. This is done 
by intersecting the cells of A with balls or complements of 
balls from a fixed collection B defined below (see figure O. 
We'll see that A' approximately maintains key properties of 
A, namely partition size and average data diameters. 

Definition 7 We define B as the following collection of balls 
in R D . Let I = [log n 2 / ( 2+d ) J . For each i = Oto I, consider 
a minimal (2~' l Ax) -cover of X; let Bi be the set of all balls 
B (z, 2~( l ~ 2 ^ Ax) centered at points z in the cover. We set 
B = u( =Q B t . 



Every cell A £ A such that iflX^D will be replaced in 
A' by two cells A' x , A' 2 obtained as follows. 

Consider the smallest i £ {0, . . . , 1} such that Ax < 

max{A„ (A) ,2~ ! A x }, i.e. * = min {/, flog }. 



There exists a ball B G Bi which covers AnX: pick any x G 
iflX, and pick the ball B in Bi whose center z is closest to 
x\ we have Vx' e An X, that x' e B = B (z, 2^^^ A x ) 
since by a triangle inequality 

\\z-x'\\ < \\z-x\\ + \\x^x'\\<2~ l A x + A n (A) 
< 2~ l A x + 2~ {l " 1] A x < 2~ ( -' l -^A x . 

We define A\ = B n A and A' 2 = A \ A[ for all A G 
A, A n X ^ 0; on the other hand we let A\ = A, A 2 = 
for all A G A, A n X = 0. We finally define A' to be the 
collection of all such A\ , A' 2 over A G A. 

In the following lemma we relate diameters of cells of 
A' to the data diameters of cells of A. 

Lemma 8 (Diameters of A') Let A be some partition of X 
and let A! as defined above. We have that 



E 

A'gA' 



H n (A')A 2 (A') < 64A 2 (A) + 256n~ 4 ^ 2+d] ■ A 



Proof: Let A G A, A n X ^ 0. We have ^n(^i) = Mn(^L) 
and /j, n (A' 2 ) = 0. Also, given the smallest i G {0, ...,/} 
such that 2 -i A x < max { A„ (A) , 2 _/ A^ }, we have that 

• A„ (A) > 2- 7 A;f implies A(A' 1 ) < 2 • 2~( i - 2 )A A - < 
8A„ (A) , 

• A n (A) < 2- J A x implies A(A' 1 ) < 2 ■ 2-^-^A x < 

16n -2/(2+d) . Ax 

Therefore, let A + ={ie A, A„ (A) > 2~ 7 A^}, we have 

Vn{A')A 2 {A') = ^{A)A 2 (A\) 
A'eA' AeA + 

+ J2 »n(A)A 2 (A[) 

AeA\A + 

< Yl ^n{A)Al {A) 

+ Y 256^„(A)r7, _ TT3 . A 2 X 

AeA\A + 

< 64A 2 (A) + 256n"5T3 . A 2 X . 



In order to bound the integrated excess risk, we'll need 
the empirical mass of cells of A' to be close to their true 
mass. In particular, this will allow us to effectively discard 
cells that are empty of data since they will have little effect 
on the integrated excess risk. The following lemma from VC 
theory will come in handy. 



Lemma 9 (Relative VC bounds - 0VC71I ) Let C be a class 
of subsets ofM. D , and let its 2n-shatter coefficient be given 
by S (C, 2n). With probability at least 1 — 8' over the choice 
o/X, all A' G C satisfy 

V n 
^S{C,2n) + \n{A/8') 



< 



The next lemma establishes the convergence of empirical 
masses of cells of A'. 

Lemma 10 (Mass of cells of A') With probability at least 1- 
8' over X and the randomness in the algorithm, we have for 
all RPtree partitions A, for all A' G A' that 



V n 



. V + hx(4/8') , 

+ 4 — — -, where 

n 

V < 0(log7i)(logn + loglog(l/£)). 



Proof: Suppose w.l.o.g that the RPtree is built by picking 
random directions from a fixed collection V without replace- 
ment. How big should V be so we have enough directions to 
choose from? The implementation of coreRPtree ensures 
that \V\ < 2n 6 log (6n 2 /S) is sufficient (see remark |4] of 
section [5TTb . Now fix such a collection V and let H-p be the 
union of {X} and the class of half spaces of M. D defined by 
hyperplanes normal to the directions in V. For an RPtree par- 
tition A, each cell of A is the intersection of at most 6 log n 
elements of H-p since the tree is guaranteed to have height at 
most 61ogn (remark|Ui. Each cell of A' is the intersection 
of a ball or the complement of a ball in B with a cell of A. 

All such cells therefore belong to the following class of 
subsets of R D : 



h : h 




ho or Hq is in B, hi G Ti-p 



We now proceed to bounding S (C, 2ri), the 2n-shatter 
coefficient of C as follows. 

Given 2n sample points, every direction v G V defines 
at most 2(2n + 1) equivalent choices of half-spaces in M. D . 
We therefore have 



S(C,2n) < 2|B|((4n + 2)|7>| + l) 



6 log n 



< 



2 \B\ (n 6 (8n + 4) log (6n 2 /S) + 1 



6 log 71 



Since X has Assouad dimension d, we have \B\ < J2Lo 2 * < 
2n 2d/(2+d) The proof is completed by letting V = log 5 (C, 2n) 
for V fixed, and calling on lemma|9] ■ 

Lemma 11 (Excess risk) There exists a constant C\ inde- 
pendent of d and ^(X) such that the following holds with 
probability at least 1 — 8/3 over the choice o/(X, Y) and 
the randomness in the algorithm. 

Define a(n) = (log 2 n) loglog(l/£) + log(l/5). Let A 1 
be the final partition reached by adapt iveRPtree. For all 

partitions A G {A- 7 } ,_ , we have 



||/n,A-/|f < C X 



A 2 y\K 



a(n) 



-A 2 (A 2 (A) + n- 4 /( 2+d )A 2 ,)^ 



Proof: Let the partition A G {A J- } ._ Q and the sample X 
be fixed. By lemma [10] we have, with probability at least 
1-8', that equation © holds for all A' G A' with V < 

0(logn)(logn + loglog(l/<5)). 

The excess risk decomposes over A' as 



n/n,A-/n 2 = e / wuax) - m\\\(dx). 

A'EA' JA ' 

We next divide the cells of A' into two groups: 

and A'< = A' \ A'>. 

It's easy to see that from equation (01, we have VA' G 

A'>, /i(A') < 7^(A'), andVA' G A'<, p(A') < 7 v+ y> . 
Integrating over A' <? we have 



E / ll/»,A(a:) -/(aOH 2 Mdz) 

< E A y^( A ') 
AeA'< 

< y a 2 . 7 v + 1d ^ 

< 7A 2 • |A'| • V + ln(W 



(5) 



For the integration over A^ , we first apply ([TJ, and recall 
lemmas [6] and [5] to have that with probability at least 1 — 5' 
over Y, 



E 

AeAC 



\fn,A( x ) - f( x )\\ K dx ) 



E / \\f n ,A'(x)-f(x)\\ 2 fx(dx) 

< E 2A 2 A 2 (A') ■ fi(A') 



A' £ AC, 

+ E 2A 

A'eAC, 



a 2 + ln(|A'|/<?') 
* n/i„(A') 



< E 2A 2 A 2 (A')-7 M „(A') 

A'SAC, 



E 2A 



2 + \n(\A'\/S') 
nfi n (A') 



7» n (A') 



A'eAC, 

< 14A 2 E ^«(A')A 2 (A') 

A'eAC, 

2|A „ 2 + MIA'l/*') 



-14A£|A'| • 



(6) 



Note that the term In |A'| in © is at most O(lnn) since 
the entire tree has height at most 61ogn. Combining the 
bounds in © and (O, we get that there exists a constant Co 



such that ||/n,A — /II is at most 
Co ( A 2 y -\A\ 



log 2 7iloglogl/(S + log(l/J / ) 



+A 2 E Mn(A') A 2 (A')) 

.4'GA' / 



with probability at least 1 — 26'. 

Setting 5' = 5/36 log n, the lemma follows by a union 

bound over at most 6 log n partitions in { A J }*._ , and then 

calling on lemma[H] ■ 

4 Risk of final regressor f n — f n |A * 

In this section we bound the excess risk of the final regressor 
fn = fn,A* in terms of the diameter decrease rate attained 
when adapt iveRPtree stops. 

To see that the stopping criteria eventually hold, note that 
the implementation of coreRPtree ensures that all cells at 
some level down the hierarchy have a single data point in 
them (see remark|4|i. In other words, we have A„ (A 1 ) = 
eventually, forcing either stopping criterion to hold. 

We now outline the arguments in this section. For sim- 
plicity, assume Ax, Ay, and A are all 1. Consider some 
RPtree partition A and let A n (A) r* £ for some scalar £, 
we then have |A| < where k is the diameter decrease 
rate attained by the algorithm. From lemma QT| above, we 
roughly have \\f n ,A — f\\ 2 S C~V n + C 2 i an d the t> est 
bound is obtained by setting £ m n~ 1 /( 2 + fe ). Provided we 
pick an appropriate partition which optimizes (, the final 
bound would then take the form ||/ nj A* — /|| 2 S n^ 2 ^ 2+k \ 

4.1 Risk bound for cross-validation option 

Lemma 12 (Existence of a good pruning) Suppose the cross- 
validation option is used, and adapt iveRPtree attains a 
diameter decrease rate ofk on X. Define 

(log 2 n) loglog(n/<5) + Iog(l/5), 



a(n) 



and C 



l/(2+fc) 



Let n > max 



( AAa- 

I Ay 



,a(ri) 



and for i > 0, let A 1 as defined in adaptiveRPtree. Then 
there exists iq > such that A n (A l °) < 2£ • A n (X) and 

|A*°| < C~ fe - 

Proof: Let i > 0. We have by definition that A„ (A 1 ) < 
2 _l A„ (X), while it follows from the assumption on diam- 
eter decrease rate that level (A*) < ki. Now let A 1 be the 
last partition of X achieved by adaptiveRPtree when the 
stopping criteria holds. We have either that A„(A Z ) = < 
C • A„ (X), or 

ki > level (A 4 ) > logn 2 > k logn 2/(fc+2) > fclogl/C, 

implying that A n (A 1 ) < 2~ 4 ■ A n (X) < ( ■ A„ (X). 

Now, let j G 1, . . . , i be the first j such that A„ (A 3 ) < 
C • A n (X). We consider the following two cases: 

• Either level (A-?) < logC~ fe , and we get I A J I < (~ k . 



• Or level (A J ) > logC k in which case the following 
must hold: 

- A„ (A^ 1 ) < 2(-A n (X), since kj > level (A^) > 
fclogl/C, implying that j - 1 > log(l/2C). 

- level (A^ -1 ) < logC _fc , for otherwise J — 1 > 
logl/C implying that A„ (A^ 1 ) < CA„ (X). It 
follows that \A^ 1 \ < C~ fe 

Thus, either A J or A J 1 satisfies the claim. ■ 

Lemma 13 There exists a constant C independent of d and 
p(X), such that the following holds with probability at least 
1 — 2(5/3 over (X, Y) and the randomness in the algorithm. 

Suppose the cross-validation option is used, and proce- 
dure adaptiveRPtree attains a diameter decrease rate of 
k > d on X. Define 

a{n) = (\og 2 n) loglog(n/<5) +log(l/<5), 

and assume n > max |(AA^/Aj;) 2 , a(n) X. The excess 
risk of the regressor is then bounded as 

2/(2+*;) 



\\fn-f\r < C-(XA X ) 



2fc/(2+fc) 



a(n) 



2A^/ lnl ° g " 6 + ln3/5 



2n 



Proof: Let A 4 ° be as in lemma[T2] and ( = 

By applying lemma QT| and then lemma 
probability at least 1 — 5/3 that 

a(n) 



( A%-a{n 

H2 we h£ 



l/(2+fc) 



have with 



A y | AI ° | 



A 2 (a 2 (A^+n^+^A 2 



{A 2 yC k ^_ + 5\ 2 ( 2 A 2 x 



< C 2 \ 2 A%{ 2 . 

To analyze the cross validation phase, we first fix the parti- 
tion tree and consider the obtained partitions from A to the 
final partition A' when the stopping criteria holds. We have 
with probability at least 1 — 5/3 over the choice of (X', Y') 
that V? G {0, . . . , i} 



\R(f n ^)-K(f n ^)\<±l ^ n l + ™ /6 . 

The above is obtained by applying McDiarmid's to the em- 
pirical risk followed by a union bound over at most 6 log n 
regressors f n>A _ } , j G {0, . . . , i}. 

Let f n = fnM be the empirical risk minimizer, we can 
then conclude that 



||/n-/|| 2 < C 2 A 2 A 2 ,C 2 + 2A 
with probability at least 1 — 25/3. 



, In log n 6 + In 3/5 



2n 



4.2 Risk bound for automatic stopping option 

Lemma 14 (Properties of A) Suppose the automatic stop- 
ping option is used, and that adaptiveRPtree attains a di- 
ameter decrease rate of k on X. Define 

a(n) = (log 2 n) loglog(n/<5) +log(l/<5), 

/ (n) \ 1/(2+*;) 

and £ = I ) • Finally, assume n > a(n). Then, 

the following holds for the final partition A retained for re- 
gression: 



A|+A 2 (A) <(4A 2 (*) + 1)C 2 . 



a(n) 



Proof: For i > 0, let A 4 as defined in adaptiveRPtree. 
We have by definition that A„ (A 1 ) < 2 _ *A„ (X), while it 
follows from the assumption on diameter decrease rate that 
level (A 4 ) < ki. Now for some i > 1, let A* be the fi- 
nal partition of X achieved by adaptiveRPtree when the 
stopping criteria holds. We consider the following two cases: 

• Either level (A 4 ) < \og(~ k , and we have by the stop- 
ping condition that: 



A I (A 4 



< 



< 



a(n) 



2 levcl(A*) , A 2 ( X j 



a ( n ) A-feA2 



• Or level (A 1 ) > log£ fc , in which case the following 
must hold: 

- A„ (A 4 " 1 ) < 2C-A n (X), since ki > level (A 4 ) > 
fclog(l/C), implying that i - 1 > log(l/2C). 

- level (A 4-1 ) < log£~ fe , for otherwise we would 
have stopped at i — 1. To see this, assume instead 
that level (A 4-1 ) > log£~ fc : we have that (i — 
1) > log j an d subsequently that 

A 2 (A 4 " 1 ) < 2- 2 ( 4 ~ 1 )A 2 (A')<C 2 A 2 (A') 
a(n) 



< 



C k ■ K (x) 

n 

^H* W ) • A 2 „ (X) 



In other words, 

level (A 4 - 1 ) > log (nA 2 (A 4 " 1 ) /a(n)A 2 n (X)) . 

In either case at least one of A 4 and A 4-1 has size at 
most and diameter at most 2£ • Ax- It follows that 



mm 

je{i-x,i} 



a(n) 



A 2 (AO < 



a(n) 



C k + 4C 2 • A 2 (X) = (4A 2 (X) + 1) C 2 , 



which concludes the argument. 



Lemma 15 There exists a constant C independent of d and 
p{X), such that the following holds with probability at least 
1 — (5/3 over (X, Y) and the randomness in the algorithm. 

Suppose the automatic stopping option is used; assume 
adaptiveRPtree attains a diameter decrease rate ofk>d 
onX. Define a(n) = (log 2 nj loglog(n/<S) + log(l/<$). The 
excess risk of the regressor is then bounded as 



\U-f\\ 2 <C(Al + X 2 )(A% + l) ' a{n) 



2/(2+fe) 



Proof: For n < a(n), the bound on the excess risk holds 
vacuously. We assume henceforth that n > a{n). Let ( = 

/ , >\ l/(2+fe) 

( ri ) ■ ^ rst a PPly m 8 lemmafTTIthen lemma [141 

we have with probability at least 1 — 5/3 that 

ll/^-/|| 2 < cJa^ia^ 



+A 2 (A 2 (A^+n-^+^A 2 ,)^) 



< C^A^ + X 2 



+ (A 2 (A*)+n- 4 /( 2 + d >A 2 ,) 

< C 1 (A 2 ; + A 2 )((4A 2 t + l)C 2 + C 2 A 2 Y ) 

< C(A 2 ; + A 2 )(A 2 Y + 1)C 2 , 

which concludes the argument. ■ 

5 Core RPtree and diameter decrease rates 
5.1 Core RPTree procedures 



Procedure basicRPtree {An C X, A, level I) 



A «- {A }; 

for i <— 1 to oo do 

if A„ (Aj_i) < A then 
return ; 

end 

Choose a random direction v ~ M (0, -jj-Td); 
Choose a random r ~ K[—l, 1] • ^=A„(ylo); 

foreach cell A £ Ai-i do 
if (I + i) is odd then 

// Noisy splits. 

t «— median{z T u : z € X n A®} + r; 

else 

// Median splits. 

t «— median{z T w : z £ X n A}; 

end 

Am ^ {x e A, x T v < t}; 

bright 4 A\ A[ e ft, 

end 

Aj <— partition of Ao defined by the leaves of the 
current tree; 

end 



Procedure coreRPtree {Ao C X, A, S, level I) 



Call basicRPtree {A Q , A, I) log (6n 2 /<5) times 
and return the shortest tree. 



RPtree consists of hierarchically bisecting the data space 
with random hyperplanes. In basicRPtree we alternate be- 
tween two types of bisections: we split exactly at the median 
in order to balance the tree, while we split at the median + 
noise to improve the rate at which the data diameters are re- 
duced down the tree. Notice that for the "noisy" split we use 
the same hyperplane to bisect all nodes A £ A,_i. 

The procedure coreRPtree serves to boost the probabil- 
ity that we get a small tree. The many calls to basicRPtree 
can be done in parallel so that we don't keep growing the 
trees that are to be discarded once the smallest tree is identi- 
fied. 

Remark 4 Given the implementation of coreRPtree, the 
tree returned by adaptiveRPtree has the following prop- 
erties: 

• Any node at level 6 log n has at most 1 data point: the 
data is split at the exact median at every other level 
so that the number of points per nodes decreases ex- 
ponentially from the root down. If n were a power of 
2, we 'd need at most 2 log n levels to get to 1 point 
per node. For general n, notice that the number of 
points in a node at level i > 2 is at most | of that of 
its ancestor at level i — 2. In other words we need at 
most 21ogn/log(4/3) < 61ogn levels to get down to 
1 point per node. 

• As a consequence, the entire tree reaches depth at most 
6 log n under either stopping criteria, and therefore has 
at most 2n 6 nodes. 

• Another consequence is that at most 2n 6 log(6n 2 /<5) 
random directions are required to build the entire tree. 

5.2 Worst case decrease rates 

In this section we consider worst case bounds for the diam- 
eter decrease rates attainable by the algorithm over supports 
of low intrinsic dimension. 

The following theorem, adapted from Dasgupta and Fre- 
und HDF08I . is the core of the argument. 

Theorem 16 Let A c MP and suppose AflX has Assouad 
dimension d. There exists a constant C independent of the 
sample X and d, with the following property. We have with 
probability at least ^ that the tree rooted at A returned by 
the call basicRPtree(yl, A„ (A) /2,l) has depth at most 
C'dlogd. 

Proof Idea: The proof is a direct consequence of lemma 9 
of HDFQ 8 11 applied to the "noisy" splits at alternating levels 
in procedure basicRPtree. 

Let r = A n (A)/512y/d and consider an r-cover of A; 
now consider pairs of balls B = B(z,r), B' = B(z',r), 
where z, z' are in the cover and \\z — z'\\ > hA n (A) — 2r. 
Notice that basicRPtree stops if for all such pairs, no leaf 
of the tree contains points from both BnX and B' n X. 




Figure 4: Hilbert space filling curve, balls of smaller radius 
have lower Assouad dimension. 

Fix such a pair B and B'. By lemma 9 of IDF08I . every 
"noisy" split has a constant probability of separating BflX 
and B' n X. Thus, the probability that some cell at level i 
contains points from both B n X and B'nX goes down 
exponentially with i, A union bound over at most (0(d) d ) 
such pairs yields the theorem. ■ 

Corollary 17 Suppose X has Assouad dimension d. Let 
C be as in theorem [76] Fix X. With probability at least 
1 — (5/3 over the randomness in the algorithm, the proce- 
dure adapt iveRPtree attains a diameter decrease rate of 
k < C'dlogdonX. 

Proof: Consider a subtree rooted at A returned by the call 
coreRPtree(A, A„ (A) /2, 8, 1) in the second loop of pro- 
cedure adapt iveRPtree. Since X has Assouad dimension 
d, A fl X also has Assouad dimension d by definition so the- 
orem[T6lholds. 

Procedure coreRPtree calls basicRPtree as many as 
log (6n 2 /S) times and returns the smallest tree; thus the prob- 
ability that the subtree rooted at A has depth over C'd log d 
is at most S/6n 2 . Now, under both stopping conditions, 
coreRPtree is only called on nodes at level at most logn 2 ; 
a union bound over all such nodes (at most 2n 2 ) yield a prob- 
ability of failure at most 8/3. M 

6 Final Remarks 

We have shown in this paper that an RPtree regressor will 
perform well in a scenario where the data space X has low 
Assouad dimension d « D, 

Our results are easily extended to other settings. We can 
for example consider a scenario where the data has low As- 
souad dimension d at small resolution but "fills" up space at 
higher resolution. One may think for instance of a Hilbert 
space filling curve where balls of small enough radius have 
low Assouad dimension relative to the entire space, (see fig- 
ure |U). RPtree in this case would initially decrease diameter 
at a slow rate till it arrives at small enough neighborhoods, 
at which time the diameter decrease rates speed up. Even 
in this case, the complexity of the data in larger regions of 
space has little effect on the final excess risk, provided n is 
large enough for the tree to arrive at well populated regions 
with sufficiently small diameter. 
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